Toward self-documenting, customizable microbiome analysis and metaanalysis with scikit-bio, QIIME 2 and Qiita.

Greg Caporaso and Rob Knight

Caporaso Lab, Northern Arizona University

Knight Lab, University of California San Diego

QIIME

Cited 2969 times (as of 16 July 2015) since publication.

Nearly monthly workshops, somewhere in the world.

The next generation of QIIME: scikit-bio, QIIME 2 and Qiita.
- Simplified installation.
- Browser-based interface will replace/supplement the command line interface.
- Workflow transparency.
- Plugin framework, so other developers can build on QIIME.

scikit-bio

Low-level, a bioinformatics library for data scientists, students, educators, and bioinformatics software developers.

Free, Sloan-funded companion text, An Introduction to Applied Bioinformatics available at readIAB.org.

QIIME 2

A stable and well-documented API and command line interface.

scikit-bio

Low-level, a bioinformatics library for data scientists, students, educators, and bioinformatics software developers.

Free, Sloan-funded companion text, An Introduction to Applied Bioinformatics available at readIAB.org.

Qiita

End-user, browser-based environment for microbiome analysis and meta-analysis.

QIIME 2

A stable and well-documented API and command line interface.

scikit-bio

Low-level, a bioinformatics library for data scientists, students, educators, and bioinformatics software developers.

Free, Sloan-funded companion text, An Introduction to Applied Bioinformatics available at [readIAB.org

A scikit is a python scientific computing library that cleanly integrates with others.

scikit-bio is built on top of scipy, numpy, pandas, etc.

This enables:

beautiful visualizations with seaborn
powerful machine learning and optimization with scikit-learn
fast sequence collapsing with marisa-trie and related packages
statistical modeling with Stats.Models
whatever comes next...

scikit-bio 0.4.0, our first beta release, is now available

Developing your bioinformatics software with scikit-bio gets you a lot of functionality for free.

See Jai Rideout and Evan Bolyen's SciPy 2015 talk for more detail and cool demos http://bit.ly/skbio-scipy2015

(We'll tweet the link with #microbenet.)

Example: building a better taxonomy classifier in under 100 lines of code



In [ ]:

    
from skbio import DNA
import numpy as np
import skbio

## Load Greengenes and slice all sequences to the region 
## amplified by 515F/806R. 

aligned_seqs_fp = 'data/gg_13_8_otus/rep_set_aligned/82_otus.fasta'
taxonomy_fp = 'data/gg_13_8_otus/taxonomy/82_otu_taxonomy.txt'


fwd_primer = DNA("GTGCCAGCMGCCGCGGTAA",
                 metadata={'label':'fwd-primer'})
rev_primer = DNA("GGACTACHVGGGTWTCTAAT",
                 metadata={'label':'rev-primer'}).reverse_complement()

def seq_to_regex(seq):
    result = []
    for base in str(seq):
        if base in DNA.degenerate_chars:
            result.append('[{0}]'.format(
                ''.join(DNA.degenerate_map[base])))
        else:
            result.append(base)

    return ''.join(result)

regex = '({0}.*{1})'.format(seq_to_regex(fwd_primer),
                            seq_to_regex(rev_primer))

starts = []
stops = []
for seq in skbio.io.read(aligned_seqs_fp, format='fasta', 
                         constructor=DNA):
    for match in seq.find_with_regex(regex, ignore=seq.gaps()):
        starts.append(match.start)
        stops.append(match.stop)
        
locus = slice(int(np.median(starts)), int(np.median(stops)))

## Get all kmers for all sequences.
kmer_counts = []
seq_ids = []
for seq in skbio.io.read(aligned_seqs_fp, format='fasta',
                         constructor=DNA):
    seq_ids.append(seq.metadata['id'])
    sliced_seq = seq[locus].degap()
    kmer_counts.append(sliced_seq.kmer_frequencies(8))
    
    
## Load the training/test data, perform feature selection, and train 
## and test the classifier using cross-validation.
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import SelectPercentile
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split

X = DictVectorizer().fit_transform(kmer_counts)

taxonomy_level = 3 # class
id_to_taxon = {}
with open(taxonomy_fp) as f:
    for line in f:
       id_, taxon = line.strip().split('\t')
       id_to_taxon[id_] = '; '.join(taxon.split('; ')[:taxonomy_level])

y = [id_to_taxon[seq_id] for seq_id in seq_ids]

X = SelectPercentile().fit_transform(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)
y_pred = SVC(C=10, kernel='linear', degree=3,
             gamma=0.001).fit(X_train, y_train).predict(X_test)

## Define a function to use for plotting. 
%matplotlib inline
import matplotlib.pyplot as plt

def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.figure()
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    plt.ylabel('Known taxonomy')
    plt.xlabel('Predicted taxonomy')
    plt.tight_layout()
    plt.show()

Support Vector Machine genus classifier

Naive Bayes genus classifier

QIIME 2

Self-documenting analyses simplify bioinformatics methods reporting, and therefore reproducibility and transparancy.